Senior Big Data Engineer Resume
Purchase, NY
SUMMARY
- Around 8+ years of professional IT experience involving project development, implementation, deployment, and maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase, Spark integration with Cassandra, Avro, Solr and Zookeeper.
- 7+Years of experience As Developer using Big Data Technologies like Databricks/Spark and Hadoop Ecosystems.
- Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
- Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters In Databricks, Managing the Machine Learning Lifecycle
- Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS.
- Proficiency in multiple databases like MongoDB, Cassandra, MySQL, ORACLE, and MS SQL Server.
- Extensive knowledge on QlikView Enterprise Management Console (QEMC), QlikView Publisher, QlikView Web Server.
- Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology.
- Worked as team JIRA administrator providing access, working assigned tickets, and teaming with project developers to test product requirements/bugs/new improvements.
- Created Snowflake Schemas by normalizing the dimension tables as appropriate and creating a Sub Dimension named Demographic as a subset to the Customer Dimension.
- Experienced in Pivotal Cloud Foundry (PCF) on Azure VM's to manage the containers created by PCF.
- Hands on experience in test driven development (TDD), Behavior driven development (BDD) and acceptance test driven development (ATDD) approaches.
- Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB),SQL Server, Oracle,Data Warehouse etc. Build multiple Data Lakes.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau, PowerBI.
- Worked with Google Compute Cloud Data Flow and Big Query to manage and move data within a 200 Petabyte Cloud Data Lake for GDPR Compliance. Also designed star schema in Big Query.
- Provided full life cycle support to logical/physical database design, schema management and deployment. Adept at database deployment phase with strict configuration management and controlled coordination with different teams.
- Experience in writing code in R and Python to manipulate data for data loads, extracts, statistical analysis, modeling, and data munging.
- Familiar with latest software development practices such as Agile Software Development, Scrum, Test Driven Development (TDD) and Continuous Integration (CI).
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy. Experience in working on creating and running docker images with multiple microservices.
- Utilized analytical applications like R, SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.
- Extensive hands-on experience in using distributed computing architectures such as AWS products (e.g. EC2, Redshift, EMR, and Elastic search), Hadoop, Python, Spark and effective use of Azure SQL Database, MapReduce, Hive, SQL and PySpark to solve big data type problems.
- Strong experience in Microsoft Azure Machine Learning Studio for data import, export, data preparation, exploratory data analysis, summary statistics, feature engineering, Machine learning model development and machine learning model deployment into Server system.
- Proficient in Statistical Methodologies including Hypothetical Testing, ANOVA, Time Series, Principal Component Analysis, Factor Analysis, Cluster Analysis, Discriminant Analysis.
- Expertise in transforming business resources and requirements into manageable data formats and analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
- Worked with various text analytics libraries like Word2Vec, GloVe, LDA and experienced with Hyper Parameter Tuning techniques like Grid Search, Random Search, model performance tuning using Ensembles and Deep Learning.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
- Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
- Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Experience with leveraging APIs and micro-services for serving and managing data on AWS.
- Use and application of API Management technologies, such as AWS API Gateway, RESTful APIs, Route 53, AWS Lambda, webservices, etc. would be preferred.
- Familiarity with automation frameworks such as Junit, Jasmine, Easy Mock, etc.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
- Skilled in performing data parsing, data ingestion, data manipulation, data architecture, data modelling and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, HBASE, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Elastic Search, Solr, MongoDB, Cassandra, Avro, Storm, Parquet, Snappy, AWS
Machine Learning Classification Algorithms: Logistic Regression, Decision Tree, Random Forest, K-Nearest Neighbor (KNN), Gradient Boosting Classifier, Extreme Gradient Boosting Classifier, Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayes Classifier, Extra Trees Classifier, Stochastic Gradient Descent, etc.
Cloud Technologies: AWS, Azure, Google cloud platform (GCP)
IDE’s: IntelliJ, Eclipse, Spyder, Jupyter, Netbeans.
Ensemble and Stacking: Averaged Ensembles, Weighted Averaging, Base Learning, Meta Learning, Majority Voting, Stacked Ensemble, AutoML - Scikit-Learn, MLjar, etc.
Databases & Warehouses: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE, NoSQL, SQL Server, MS Access, Teradata
Programming/ Query Languages: Java, SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, SAS, R Programming (Caret, Glmnet, XGBoost, rpart, ggplot2, sqldf), RStudio, PL/SQL, Linux shell scripts, Scala.
Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools: Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, Mahout, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Salesforce, NI-FI, GCP, Google Shell, Linux, Big Query, Bash Shell, Unix, Tableau, Power BI, SAS, We Intelligence, Crystal Reports.
Version Controllers: GIT, SVN, CVS
ETL Tools: Informatica, AB Initio, Talend
Operating Systems: UNIX, LINUX, Mac OS, Windows, Variants
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapReduce, Apache EMR
PROFESSIONAL EXPERIENCE
Confidential - Purchase, NY
Senior Big Data Engineer
Responsibilities:
- Build scalable and reliable ETL systems to pull large and complex data together from different systems efficiently.
- Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark with Cloudera distribution.
- Used Amazon Web Services (AWS) which include EC2, S3, Cloud Front, Elastic File System, RDS, VPC, Direct Connect, Route53, Cloud Watch, Cloud Trail, Cloud Formation, and IAM which allowed automated operations.
- Worked on Cloudera distribution and deployed on AWS EC2 Instances.
- Hands on experience on Cloudera Hue to import data on the GUI.
- Worked on integrating Apache Kafka with Spark Streaming process to consume data from external REST APIs and run custom functions.
- Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.
- Developed Spark scripts by using Scala Shell commands as per the requirement.
- Configured, deployed, and maintained multi-node Dev and Tested Kafka Clusters.
- Developed in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
- Involved in running Hadoop streaming jobs to process terabytes of text data. Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet.
- Configured, supported, and maintained all network, firewall, storage, load balancers, operating systems, and software in AWS EC2.
- Implemented the use of Amazon EMR for Big Data processing among a Hadoop Cluster of virtual servers on Amazon related EC2 and S3.
- Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
- Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances
- Implementations of generalized solution model using AWS SageMaker.
- Extensive expertise using the core Spark APIs and processing data on an EMR cluster
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
- Experience to manage IAM users by creating new users, giving them a limited access as per needs, assign roles and policies to specific user.
- Developed analytical component using Scala, Spark and Spark Stream.
- Act as technical liaison between customer and team on all AWS technical aspects.
- Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Good knowledge in using Data Manipulations, Tombstones, Compactions in Cassandra. Well experienced in avoiding faulty Writes and Reads in Cassandra.
- Performed data analysis with Cassandra using Hive External tables.
- Designed the Column families in Cassandra.
- Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Implemented YARN Capacity Scheduler on various environments and tuned configurations according to the application wise job loads.
- Configured Continuous Integration system to execute suites of automated test on desired frequencies using Jenkins, Maven & GIT
- Experience with Agile and Scrum Methodologies. Involved in designing, creating, managing Continuous Build and Integration Environments.
Environment: Hadoop, HDFS, Hive, Spark, Cloudera, AWS EC2, AWS S3, AWS ERM, Sqoop, Kafka, Yarn, Shell Scripting, Scala, Pig, Cassandra, Oozie, Agile methods, MySQL
Confidential - Deerfield Beach, Florida
Sr. Data Engineer
Responsibilities:
- Experienced in development using Cloudera distribution system.
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
- Worked on creating tabular models on Azure analysis services for meeting business reporting requirements.
- Have good experience working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW)
- As a Hadoop Developer my responsibility is managing the data pipelines and data lake.
- Have experience of working on Snow -flake data warehouse.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Designed custom Spark REPL application to handle similar datasets.
- Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.
- Performed Hive test queries on local sample files and HDFS files.
- Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing.
- Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
- Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Developed Spark Programs using Scala and Java API's and performed transformations and actions on RDD's.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Develop ETL Process usingSPARK, SCALA, HIVE and HBASE.
- Developed REST APIs using Scala, Play framework and Akka.
- Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.
- Assigned name to each of the columns using case class option in Scala.
- Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS
- Developed Spark SQL to load tables into HDFS to run select queries on top.
- Developed analytical component using Scala, Spark, and Spark Stream.
- Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
- Worked on the NoSQL databases HBase and mongo DB.
- Perform validation and verify software at all testing phases which includes Functional Testing, System Integration Testing, End to End Testing, Regression Testing, Sanity Testing, User Acceptance Testing, Smoke Testing, Disaster Recovery Testing, Production Acceptance Testing and Pre-prod Testing phases.
- Have good experience in logging defects in Jira and Azure Devops tools.
- Experienced in Installation, Configuration, and Administration of Informatica Data Quality and Informatica Data Analyst.
- Expertise in address data cleansing using Informatica Address Doctor to find deliverable personal and business addresses.
- Analyzed Data Profiling Results and Performed Various Transformations.
- Hands on Creating Reference Table using Informatica Analyst tool as well as Informatica Developer tool.
- Writed Python scripts to parse JSON documents and load the data in database.
- Generating various capacity planning reports (graphical) using Python packages like Numpy, matplotlib.
- Analyzing various logs that are been generating and predicting/forecasting next occurrence of event with various Python libraries.
- Hands-on experience with Snowflake utilities, SnowSQL, SnowPipe, Big Data model techniques using Python / Java.
- ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
- Used python APIs for extracting daily data from multiple vendors.
Environment: Hadoop, Hive, Oozie, Java, Linux, Maven, Oracle 11g/10g, Zookeeper, MySQL,Spark, IDQ Informatica Tool 10.0, IDQ Informatica Developer Tool 9.6.1 HF3.
Confidential -Urbandale - Iowa
Data Engineer/ Data Scientist
Responsibilities:
- Experience in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
- Experienced Data Scientist with over 1 year experience in Data Extraction, Data Modelling, Data Wrangling, Statistical Modeling, Data Mining, Machine Learning and Data Visualization.
- Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.
- Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS so as to use it for the analysis.
- Migrated Existing MapReduce programs to Spark Models using Python.
- Migrating the data from Data Lake (hive) into S3 Bucket.
- Done data validation between data present in Data Lake and S3 bucket.
- Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
- Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
- Used Kafka for real time data ingestion.
- Created different topic for reading the data in Kafka.
- Read data from different topics in Kafka.
- Moved data from s3 bucket to Snowflake data warehouse for generating the reports.
- Created database objects like Stored Procedures, UDFs, Triggers, Indexes and Views using TSQL in both OLTP and Relational data warehouse in support of ETL.
- Developed complex ETL Packages using SQL Server 2008 Integration Services to load data from various sources like Oracle/SQL Server/DB2 to Staging Database and then to Data Warehouse.
- Created report models from cubes as well as relational data warehouse to create ad-hoc reports and chart reports
- Written Hive queries for data analysis to meet the business requirements.
- Migrated an existing on-premises application to AWS.
- Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Created many Spark UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark Sql.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
- Good knowledge on Spark platform parameters like memory, cores and executors
- By using Zookeeper implementation in the cluster, provided concurrent access for hive tables with shared and exclusive locking.
Environment: Linux, Apache Hadoop Framework, HDFS, YARN, HIVE, HBASE, AWS (S3, EMR), Scala, Spark, SQOOP, MS SQL Server 2014, Teradata, ETL, SSIS, Alteryx, Tableau (Desktop 9.x/Server 9.x), Python 3.x(Scikit-Learn/Scipy/Numpy/Pandas), AWS Redshift, Spark (Pyspark, MLlib, Spark SQL).
Confidential
Data Analyst/ Python Developer
Responsibilities:
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Recommended structural changes and enhancements to systems and databases.
- Conducted Design reviews and Technical reviews with other project stakeholders.
- Was a part of the complete life cycle of the project from the requirements to the production support.
- Created test plan documents for all back-end database modules.
- Used MS Excel, MS Access, and SQL to write and run various queries.
- Worked extensively on creating tables, views, and SQL queries in MS SQL Server.
- Worked with internal architects and assisting in the development of current and target state data architectures.
- Coordinate with the business users in providing appropriate, effective, and efficient way to design the new reporting needs based on the user with the existing functionality.
- Remain knowledgeable in all areas of business operations to identify systems needs and requirements.
- Perform troubleshooting, fixed and deployed many Python bug fixes of the two main applications that were a main source of data for both customers and internal customer service team.
- Write Python scripts to parse JSON documents and load the data in database.
- Generating various capacity planning reports (graphical) using Python packages like Numpy, matplotlib.
- Analyzing various logs that are been generating and predicting/forecasting next occurrence of event with various Python libraries.
- Performed Exploratory Data Analysis, trying to find trends and clusters.
- Built models using techniques like Regression, Tree based ensemble methods, Time Series forecasting, KNN, Clustering and Isolation Forest methods.
- Worked on data that was a combination of unstructured and structured data from multiple sources and automated the cleaning using Python scripts.
- Extensively performed large data read/writes to and from csv and excel files using pandas.
- Tasked with maintaining RDD's using SparkSQL.
- Communicated and coordinated with other departments to collection business requirement.
- Created Autosys batch processes to fully automate the model to pick the latest as well as the best bond that fits best for that market.
- Created a framework using plotly, dash and flask for visualizing the trends and understanding patterns for each market using the history data.
- Used python APIs for extracting daily data from multiple vendors.
- Used Spark and SparkSQL for data integrations, manipulations.Worked on a POC for creating a docker image on azure to run the model
Environment: SQL, SQL Server, MS Office, MS Visio, SQL Server 2012, Jupyter, R 3.1.2, Python, SSRS, SSIS, SSAS, MongoDB, HBase, HDFS, Hive, Pig, Microsoft office, SQL Server Management Studio, Business Intelligence Development Studio.