Senior Big Data Engineer Resume
Farmington Hills, MI
SUMMARY
- Over 7+ years of IT experience as a Big Data Developer, Designer & QA Engineer with cross - platform integration experience using Hadoop Ecosystem, Java, and functional automation
- Practiced Agile Scrum methodology, contributed to TDD, CI-CD and all aspects of SDLC
- Hands on experience in installing, configuring, and architecting Hadoop and Hortonworks clusters and services - HDFS, MapReduce, Yarn, Pig, Hive, Oozie, Flume, HBase, Spark, Sqoop, Flume and Oozie
- Scheduled all Hadoop/hive/Sqoop/HBase jobs using Oozie
- Complete application builds for Web Applications, Web Services, Windows Services, Console Applications, and Client GUI applications.
- Experienced in troubleshooting and automated deployment to web and application servers like WebSphere, WebLogic, JBOSS and Tomcat.
- Solid Experiences in Cloud Platforms like Amazon Web Services and Microsoft Azure Cloud Platform.
- Experienced in deploy to Integrate with multiple build systems and to provide an application model handling multiple projects.
- Hands on experience with integrating Rest APIs to cloud environment to access resources.
- Developed spark programs and created the data frames and worked on transformations.
- Worked on data processing and transformations and actions in spark by using Python (Spark) language.
- Develop framework for converting existing PowerCenter mappings and to Spark (Python and Spark) Jobs.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS
- Experienced in defining detailed application software test plans, including organization, participant, schedule, test, and application coverage scope
- Gathered and defined functional and UI requirements for software applications
- Experienced in real time analytics with Apache Spark RDD, Data Frames and Streaming API
- Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
- Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.
- Expert in utilizing Kafka for messaging and publishing subscribe messaging system.
- Experienced with Docker and Kuberneteson multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
- Monitoring tasks and DAG’s, then trigger the task instances once their dependencies are complete.
- Spinning up subprocess, which monitors and stays in sync with all DAGs in the specified DAG directory. Once per minute, by default, the scheduler collects DAG parsing results and checks whether any active tasks can be triggered.
- Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
- DevOps Practice for Micro Services using Kubernetes as Orchestrator.
- Created templates and wrote Shell scripts (Bash), Ruby, Python and PowerShell for automating tasks.
- Good knowledge and hands on Experience in monitoring tools like Splunk, Nagios.
- Knowledge of using Routed Protocols as FTP, SSH, HTTP, TCP/IP, HTTPS, DNS, VPN'S and Firewall Groups.
- Responsible for writing MapReduce programs.
- Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase
- Experienced in developing Java UDFs for Hive and Pig
- Experienced in NoSQL DBs like HBase, MongoDB and Cassandra and wrote advanced query and sub-query
- Ability to multitask multiple schedulers concurrently for performance, efficiency, and resiliency.
- Developed mappings in Informatica to load the data including facts and dimensions from various sources into the Data Warehouse, using different transformations like Source Qualifier, JAVA, Expression, Lookup, Aggregate, Update Strategy and Joiner.
- Loaded the flat files data using Informatica to the staging area.
- Review existing code, lead efforts to tweak and tune the performance of existing Informatica processes
TECHNICAL SKILLS
Hadoop/Big Data: Hadoop, Map Reduce, HDFS, Zookeeper, Kafka, Hive, PigSqoop, Oozie, Flume, Yarn, HBase, Spark with Scala
No SQL Databases: HBase,Cassandra, Mongo DB
Languages: Java, Python, Scala, Pyspark, UNIX shell scripts
Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL
Frameworks: Spring, Hibernate
Operating Systems: Red Hat Linux, Ubuntu Linux, and Windows XP/Vista/7/8
Web/Application servers: Apache Tomcat, WebLogic, JBoss
Databases: SQL Server, MySQL
IDE: Eclipse, IntelliJ
PROFESSIONAL EXPERIENCE
Senior Big Data Engineer
Confidential, Farmington Hills, MI
Responsibilities:
- Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Created Dax Queries to generated computed columns in Power BI.
- Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc.
- Monitored cluster health by Setting up alerts using Nagios and Ganglia.
- Working on tickets opened by users regarding various incidents, requests.
- Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
- Used Spring Kafka API calls to process the messages smoothly on Kafka Cluster setup.
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2)
- Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
- Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
- Optimizealgorithmwithstochastic gradient descent algorithmFine-tuned thealgorithm parameterwith manual tuning and automated tuning such asBayesian Optimization.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka.
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
- Built real time pipeline for streaming data usingKafkaandSparkStreaming.
- Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
- Generated ad-hoc reports in Excel Power Pivot and shared them using Power BI to the decision makers for strategic planning.
- Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
- Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data.
- Used ApacheSpark Data frames, Spark-SQL, Spark MLLibextensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
- Applied variousmachine learning algorithmsand statistical modeling likedecision trees, text analytics, natural language processing (NLP),supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clusteringto identify Volume usingscikit-learn packageinpython, R, and Matlab. Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
- Migration of ETL processes from RDBMS to Hive to test the easy data manipulation.
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Spark-Streaming, Hive, Scala, Hadoop, Kafka, Spark, Sqoop, Spark SQL, TDD, pig, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper, Power Bi, Azure SQL, Databricks, data lake, data storage, HDInsight, Unix/Linux Shell Scripting, Python, PyCharm, Informatica, Informatica PowerCenter Linux, Shell Scripting.
Senior Big Data Engineer
Confidential, Plano, TX
Responsibilities:
- Installed/Configured/Maintained Apache Hadoop clusters for application development based on the requirements.
- Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and then into HDFS location
- Involved in importing the real time data toHadoopusingKafkaand implemented theOoziejob for daily imports.
- Performed Real time event processing of data from multiple servers in the organization usingApache Stormby integrating withapache Kafka.
- DevelopedSparkprograms withScalaand applied principles of functional programming to do batch processing.
- Writing the Spark Core Programs for processing and cleansing data thereafter load that data into Hive or HBase for further processing.
- Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoop streaming, Apache Spark, Spark SQL, Scala, Hive, and Pig.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Developed application to clean semi-structured data like JSON/XML into structured files before ingesting them into HDFS.
- Built real time pipeline for streaming data usingKafkaandSparkStreaming.
- Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables.
- Importing existing datasets from Oracle to Hadoop system using SQOOP.
- Responsible for importing data from Postgres to HDFS, HIVE using SQOOP tool.
- Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
- Responsible for performing extensive data validation using Hive.
- Sqoop jobs, Hive scripts were created for data ingestion from relational databases to compare with historical data.
- Developed framework for automated data ingestion from different sources like relational databases, delimited files, JSON files, XML files into HDFS and build Hive/Impala tables on top of them.
- Developed real-time data ingestion application using Flume and Kafka.
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
- Involved in the development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster.
- Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.
- Developed a tool to load S3 JSON file into Hive table in parquet format in Scala and Apache Spark.
- Written a tool that scrubs numerous files in Amazon S3, getting rid of unwanted characters and other activities using Scala and Akka.
- Developed JavaMap Reduce programsfor the analysis of sample log file stored in cluster.
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and pre processing
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
- Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
- DevelopedSparkscripts by usingScalaShell commands as per the requirement.
- Developed Spark code using python for Pyspark and Spark-SQL for faster testing and processing of data
- Developed Hive UDFs using java as per business requirements
- Developed a Spark job in Java which indexes data into ElasticSearch from external Hive tables which are in HDFS.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
- Automated the data ingestion using Oozie workflows and scheduled jobs using Control-M scheduler
- Used Bit Bucket as the code repository and frequently used Git commands to clone, push, pull code to name a few from the Git repository
Environment: Cloudera, Hive, Impala, Spark, Apache Kafka, Flume, Scala, AWS, EC2, S3, DynamoDB, Auto Scaling, Lambda, Nifi, Snowflake, Java, Shell-scripting, SQL, Sqoop, Oozie, Java, PL/SQL, Oracle 12c, SQL Server, HBase, BitBucket, Control-M, Python
Big Data Developer
Confidential, Alpharetta, GA
Responsibilities:
- Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
- Developed Airflow DAGs in python by importing the Airflow libraries.
- Used Elasticsearch for indexing/full text searching.
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
- Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.
- Used Flume to collect, aggregate and store the web log data from different sources like web servers, mobile and network devices and pushed into HDFS.
- WrittenSQL queriesagainstSnowflake.
- Created and executed HQL scripts that creates external tables in a raw layer database in Hive.
- Created PySpark code that uses Spark SQL to generate data frames from avro formatted raw layer and writes them to data service layer internal tables as orc format.
- In charge of PySpark code, creating dataframes from tables in data service layer and writing them to a Hive data warehouse.
- Performed Data Preparation by using Pig Latin to get the right data format needed.
- Used python pandas, Nifi, Jenkins, nltk, and textblobto finish the ETL process of clinical data for future NLP analysis.
- Expertise in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom Map Reduce programs in Java.
- Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
- Hands on experience in using AWS services like EC2, S3, Mongo DB, Nifi, Talend, Auto scaling and DynamoDB
- Created Session Beans and controller Servlets for handling HTTP requests from Talend
- Developed a PySpark program that writes dataframes to HDFS as avro files.
- Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
- Designed and implemented effective Analytics solutions and models withSnowflake.
- Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
- Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Experience on moving raw data between different systems using Apache NIFI.
- Involved in loading data from UNIX file system to HDFS using Shell Scripting.
Environment: Hadoop, Hive, AWS, PySpark, Cloudera, MapReduce, Apache, Kafka, Java, Python, Pandas, Pig, Cassandra, Jenkins, Flume, Snowflake, SQL Server, MySQL, PostgreSQL, MongoDB, DynamoDB, Airflow, Unix, Shell Scripting.
Java/ Hadoop Developer
Confidential, NYC, NY
Responsibilities:
- Involved in requirement gathering and database design and implementation of star-schema, snowflake schema/dimensional data warehouse using Erwin.
- Experience in writing customMapReduceprograms &UDF's in Java to extendHiveandPigcore functionality.
- DevelopedPigScripts to store unstructured data inHDFS.
- Enabled speedy reviews and first mover advantages by usingOozieto automate data loading into the Hadoop Distributed File System andPIGto pre-process the data
- UsedHiveto analyze the partitioned and bucketed data and compute various metrics for reporting.
- Worked on various performance optimizations like using distributed cache for small datasets, partition and bucketing inHive, doing map side joins etc.
- Handling continuous streaming data comes from different sources usingFlumeand set destination asHDFS.
- Involved in collecting, aggregating and moving data from servers toHDFSusingFlume.
- Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
- Installed and configuredFlume,Hive,Pig,SqoopandOozieon the Hadoop cluster.
- DevelopedFlumeAgents for loading and filtering the streaming data intoHDFS.
- DevelopedPigLatin scripts to extract and filter relevant data from the web server output files to load into HDFS.
- Analyzed the data by performingHivequeries and runningPigscripts to study customer behavior.
- OptimizedMapReduceJobs to useHDFSefficiently by using various compression mechanisms.
- Experience in creating variousOoziejobs to manage processing workflows.
- Involved in creatingOozieworkflow and Coordinator jobs to kick off the jobs on time for data availability.
- Developed job workflow inOozieto automate the tasks of loading the data intoHDFSand few otherHivejobs.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Flume, Snowflake, Oozie, HBase, Sqoop, RDBMS/DB, Flat files, MySQL, Java.
SDET
Confidential, NYC NY
Responsibilities:
- Involved in all the phases of SDLC, executed test cases manually and logged defects using ClearQuest
- Attend requirement meeting with Business Analysts/ Business Users
- Analyze requirements and use cases, performed ambiguity reviews of business requirements and functional specification documents
- Automated the functionality and interface testing of the application using Quick Test Professional (QTP)
- Executed test cases manually and logged defects using HP Quality Center.
- Automated the functionality and interface testing of the application using Quick Test Professional (QTP)
- Design, Develop and maintain automation framework (Hybrid Framework).
- Conducted cross-platform and cross-browser tests and executed parallel tests on various platforms
- Client Application Testing, Web based Application Performance, Stress, Volume and Load testing of the system using Load Runner 9.5.
Environment: SQL, Oracle 10g, Apache Tomcat, HP Load Runner, IBM Rational Robot, Clear quest, Java, J2EE, HTML, DHTML, XML, JavaScript, Eclipse, WebLogic, PL/SQL, and Oracle.